3,093 research outputs found

    Stochastic model for the vocabulary growth in natural languages

    Full text link
    We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core-words which have higher frequency and do not affect the probability of a new word to be used; and (ii) the remaining virtually infinite number of noncore-words which have lower frequency and once used reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the google-ngram database of books published in the last centuries and its main consequence is the generalization of Zipf's and Heaps' law to two scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend only on the language but not on the database. From the point of view of our model the main change on historical time scales is the composition of the specific words included in the finite list of core-words, which we observe to decay exponentially in time with a rate of approximately 30 words per year for English.Comment: corrected typos and errors in reference list; 10 pages text, 15 pages supplemental material; to appear in Physical Review

    Scaling laws and fluctuations in the statistics of word frequencies

    Full text link
    In this paper we combine statistical analysis of large text databases and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. Besides the sublinear scaling of the vocabulary size with database size (Heaps' law), here we report a new scaling of the fluctuations around this average (fluctuation scaling analysis). We explain both scaling laws by modeling the usage of words by simple stochastic processes in which the overall distribution of word-frequencies is fat tailed (Zipf's law) and the frequency of a single word is subject to fluctuations across documents (as in topic models). In this framework, the mean and the variance of the vocabulary size can be expressed as quenched averages, implying that: i) the inhomogeneous dissemination of words cause a reduction of the average vocabulary size in comparison to the homogeneous case, and ii) correlations in the co-occurrence of words lead to an increase in the variance and the vocabulary size becomes a non-self-averaging quantity. We address the implications of these observations to the measurement of lexical richness. We test our results in three large text databases (Google-ngram, Enlgish Wikipedia, and a collection of scientific articles).Comment: 19 pages, 4 figure

    A network approach to topic models

    Full text link
    One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here we obtain a fresh view on the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. This is achieved by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e.g., it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. More importantly, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.Comment: 22 pages, 10 figures, code available at https://topsbm.github.io

    Using text analysis to quantify the similarity and evolution of scientific disciplines

    Full text link
    We use an information-theoretic measure of linguistic similarity to investigate the organization and evolution of scientific fields. An analysis of almost 20M papers from the past three decades reveals that the linguistic similarity is related but different from experts and citation-based classifications, leading to an improved view on the organization of science. A temporal analysis of the similarity of fields shows that some fields (e.g., computer science) are becoming increasingly central, but that on average the similarity between pairs has not changed in the last decades. This suggests that tendencies of convergence (e.g., multi-disciplinarity) and divergence (e.g., specialization) of disciplines are in balance.Comment: 9 pages, 4 figure

    Extracting information from S-curves of language change

    Full text link
    It is well accepted that adoption of innovations are described by S-curves (slow start, accelerating period, and slow end). In this paper, we analyze how much information on the dynamics of innovation spreading can be obtained from a quantitative description of S-curves. We focus on the adoption of linguistic innovations for which detailed databases of written texts from the last 200 years allow for an unprecedented statistical precision. Combining data analysis with simulations of simple models (e.g., the Bass dynamics on complex networks) we identify signatures of endogenous and exogenous factors in the S-curves of adoption. We propose a measure to quantify the strength of these factors and three different methods to estimate it from S-curves. We obtain cases in which the exogenous factors are dominant (in the adoption of German orthographic reforms and of one irregular verb) and cases in which endogenous factors are dominant (in the adoption of conventions for romanization of Russian names and in the regularization of most studied verbs). These results show that the shape of S-curve is not universal and contains information on the adoption mechanism. (published at "J. R. Soc. Interface, vol. 11, no. 101, (2014) 1044"; DOI: http://dx.doi.org/10.1098/rsif.2014.1044)Comment: 9 pages, 5 figures, Supplementary Material is available at http://dx.doi.org/10.6084/m9.figshare.122178

    On the influence of thermally induced radial pipe extension on the axial friction resistance

    Get PDF
    Within the design process of district heating networks, the maximum friction forces between the pipeline and the surrounding soil are calculated from the radial stress state and the coefficient of contact friction. For the estimation of the radial stresses, the soil unit weight, geometric properties such as the pipe's diameter and the depth of embedment, as well as the groundwater level are taken into account. For the coefficient of contact friction, different values are proposed, dependent on the thermal loading condition of the pipeline. Although this is an assumption of practical use, physically the coefficient of friction is a material constant. To revise the interaction behavior of the soil-pipeline system with respect to thermally induced radial pipe extension, a two-dimensional finite element model has been developed. Here, the frictional contact was established using Coulomb's friction law. For the embedment, sand at different states of relative density was considered. This noncohesive, granular material was described by the constitutive model HSsmall, which is able to predict the complex non-linear soil behavior in a realistic manner by stress-dependency of stiffness as well as isotropic frictional and volumetric hardening. In addition to the basic Hardening Soil model, the HSsmall model accounts for an increased stiffness in small strain regions, which is crucial for the presented investigation. After a model validation, a parametric study was carried out wherein a radial pipe displacement was applied due to thermal changes of the transported medium. Different combinations of geometry and soil property were studied. We conclude by presenting a corrective term that enables for an incorporation of thermal expansion effects into the prediction of the maximum friction force

    Business Intelligence & Analytics and Decision Quality - Insights on Analytics Specialization and Information Processing Modes

    Get PDF
    Leveraging the benefits of business intelligence and analytics (BI&A) and improving decision quality does not only depend on establishing BI&A technology, but also on the organization and characteristics of decision processes. This research investigates new perspectives on these decision processes and establishes a link between characteristics of BI&A support and decision makers’ modes of information processing behavior, and how these ultimately contribute to the quality of decision outcomes. We build on the heuristic–systematic model (HSM) of information processing, as a central explanatory mechanism for linking BI&A support and decision quality. This allows us examining the effects of decision makers’ systematic and heuristic modes of information processing behavior in decision making processes. We further elucidate the role of analytics experts in influencing decision makers’ utilization of analytic advice. The analysis of data from 136 BI&A-supported decisions reveals how high levels of analytics elaboration can have a negative effect on decision makers’ information processing behavior. We further show how decision makers’ systematic processing contributes to decision quality and how heuristic processing restrains it. In this context we also find that trustworthiness in the analytics expert plays an important role for the adoption of analytic advice
    • …
    corecore